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1. PROLOGUE 

The invitation for this discussion contribution came 
at the busiest time in my (professional) life with 
four courses and many more meetings attempting 
to compensate, psychologically, for the lost endow- 
ment at Harvard. I could not possibly, however, de- 
cline David Madigan's kind invitation. The topic is 
dear to my heart, as it should be to any statisti- 
cian's, for without "unobservables," we would be 
unemployable. And I always wanted to know what 
"h-likelihood" is! I first heard the term from my aca- 
demic twin brother, Andrew Gelman, who sent me 
his discussion of Lee and Nelder (1996). Gelman's 
conclusion was that "To the extent that the meth- 
ods in this paper give different answers from the 
full Bayesian treatment, I would trust the latter." 
This of course did not entice me to read the pa- 
per. Indeed, I still did not know its definition when 
I started to type this Prologue, nor have I had any 
professional or personal contact with either author. 
I surmise this qualifies me as an objective discussant, 
though I hope in this case the phrase objective is not 
exchangeable with noninformative or ignorant! 

But surely, one may quibble, Gelman's comment 
must have influenced me. True, but I'm not the kind 
of Bayesian who is unwilling to change his/her prior. 
My pure interest is to decode the h-likelihood. If my 
brother is right, I'll be more proud of him. If he is 
wrong, I'll be wiser by learning something new. (But 
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I do ask Professors Lee and Nelder for their toler- 
ance as I try to follow my brother's critical style, 
in the name of good discussion!) So here I am, set- 
ting aside the 72-hour Memorial Day weekend, after 
persuading my teenagers that their father's H-bomb 
mystery is more urgent to solve than his colleague 
Dr. Langdon's prevention of the antimatter explo- 
sion in 24 hours, which actually repeats every week- 
end. 

2. PREPARING FOR A BAYESIAN 
INFERENCE OF H-LIKELIHOOD 

2.1 Prior Formulation 

Naturally I will adopt a Bayesian approach to in- 
fer what is the real "H" in the h-likelihood. What 
could it actually stand for? (I) Heuristic argument? 
(II) Handy approximation? (Ill) Hybrid method? 
Or even (IV) Hidden treasure? Of course, a pri- 
ori I would not be a good Bayesian if I exclude 
"(V) Hype?" no matter how small my prior belief in 
it. Gelman's comment led me to assign the highest 
prior probability to (III), 60%. Since the events here 
are clearly not mutually exclusive, (I) and (II) also 
deserve some nontrivial prior probabilities which are 
40% each for reasons I can only explain to myself. 
But for reasons everyone can explain, the prior prob- 
abilities for categories such as (IV) or (V) are best 
kept confidential, other than that they of course de- 
pend on one's knowledge of the author(s) and the 
journal. 

2.2 Data Collection 

Immediately, I ran into the usual problem of any 
real-life data collection — there are never enough time 
or resources! It is already 2:31 pm Saturday as I 
am typing this sentence, and I yet need to read 
the paper plus four reference papers I was able to 
download from JSTOR: two discussion papers by 
the same authors (Lee and Nelder, 1996, 2005) and 
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the two papers in Biometrika that illustrate the 
use of h-likelihood (Ha, Lee and Song, 2001, Lee 
and Nelder, 2001). Lee, Nelder and Pawitan's (2006) 
book of course would be invaluable which, unfor- 
tunately, turns out to be literally true in this case 
because apparently no Harvard library can afford it. 

So I settle with these four papers as background, 
knowing well the potential bias due to my haphazard 
selection and all the "unobservables" to me at this 
moment. Hence my apologies to the authors — and 
readers — in advance. To compensate for my hasti- 
ness, I'll actually read all five papers, and the discus- 
sions, before forming my likelihood, with or without 
"H"! 

2.3 Data Processing 

Another grand challenge in real-life statistical anal- 
ysis is data processing, something that unfortunately 
has not received nearly enough systematic treatment 
in the literature but which typically can have a sub- 
stantial, if not detrimental, impact on the final con- 
clusions. One key component in data processing is 
to sort out contradictions in the data, some obvious 
and some subtle. 

A priori I did not expect this to be a part of the 
mystery that would await me. But that prior be- 
lief quickly shrank to e after reading the first para- 
graph. The authors started by emphasizing Pear- 
son's (1920) point that Fisher's likelihood is not use- 
ful for predicting future observations or unobserv- 
ables. Regardless of whether Fisher ever had such 
an intention, this is an inference/prediction issue. 
The authors then immediately stated that existing 
efforts in generalizing Fisher's likelihood inferences 
with unobservables run into the problem of not hav- 
ing "explicit forms" due to the difficult in integra- 
tion. But that is squarely a computational/ calculus 
issue. Putting aside the vast literature on the EM 
algorithm and related computational methods that 
have successfully dealt with this very computational 
issue in many common applications (see the overview 
by van Dyk and Meng (2010) and other papers in 
the coming theme issue on EM in this journal), I am 
mystified by the logic and aims here — which issue do 
the authors intend to address? Both? 

Of course this could actually be a sign of a great 
mystery novel, enticing the reader from the very 
beginning, with multiple seemingly related or unre- 
lated lines to pursue, and a Holy Grail at the end — 
a gigantic H! (Clearly I am still in my Dan Brown 



mood, though hopefully this time the Holy Grail is 
more than a legend.) 

The data processing indeed took much longer than 
I expected, mainly because the "unobservables" that 
I need to infer, from a number of mystic symbols 
whose meaning can only be surmised retrospectively 
to reasons that can explain the authors' conviction 
that their h-likelihood methods have been misunder- 
stood by almost all the discussants, since Lee and 
Nelder (1996). 

It is already 6:39 pm, Sunday. So let me get to the 
three main storylines as I comprehend. The first two 
lines are generally well understood, so I shall reflect 
on them briefly. The third line, which is the most 
controversial, namely, h-likelihood inference for un- 
observables, touches upon some fundamental issues 
about statistical inference and prediction, and turns 
out to have at least one unexpected intriguing prop- 
erty, at least to me. Therefore, the rest (three quar- 
ters of the) discussion attempts to provide an ex- 
planation of this controversy to a general audience, 
along with some ramifications and thoughts it gen- 
erates. Indeed, if a reader is in a rush to catch An- 
gles and Demons, as my teenagers were, the reader 
should just skip the following section, which con- 
tains no real enlightenment or entertainment, other 
than some shameless self-advertisements and aca- 
demic quibbles. 

3. TWO UNCONTROVERSIAL STORYLINES 

3.1 Line One: Unobservables are Useful 
for Modeling 

Much of the authors' Section 1 and Section 2 were 
devoted to arguing and demonstrating the useful- 
ness of unobservables for statistical modeling. Other 
than the authors' preference for using unobservables 
as the all-encompassing term instead of the more 
common term missing data (though I agree that 
"unobservables" is semantically more appropriate), 
the same message has been repeatedly emphasized 
in the literature, and it is indeed worthy of repeat- 
ing. As I wrote in "Missing Data: Dial M for ???", 
a J AS A Y2K vignette (Meng, 2000), "The topic of 
missing data is as old and as extensive as statistics 
itself — after all, statistics is about knowing the un- 
knowns." Unable to outshine the summary there, 
I ask readers' indulgence for a more extensive self- 
quotation. Below is the opening paragraph of the 
same vignette, echoing well the authors' key em- 
phases, but with a more extended history (e.g., McK- 
endrick's missing-data modeling/formulation went 
back 1926; see Meng, 1997): 
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The question mark is common notation 
for the missing data that occur in most 
applied statistical analyses. Over the past 
century, statisticians and other scientists 
not only have invented numerous methods 
for handling missing/incomplete data, but 
also have invented many forms of missing 
data, including data augmentation, hid- 
den states, latent variables, potential out- 
come, and auxiliary variables. Purposely 
constructing unobserved /unobservable vari- 
ables offers an extraordinarily flexible and 
powerful framework for both scientific mod- 
eling and computation and is one of the 
central statistical contributions to natu- 
ral, engineering, and social sciences. In par- 
allel, much research has been devoted to 
better understanding and modeling of real- 
life missing-data mechanisms; that is, the 
unintended data selection process that pre- 
vents us from observing our intended data. 
This article is a very brief and personal 
tour of these developments, and thus nec- 
essarily has much missing history and ci- 
tations. The tour consists of a number of 
Ms, starting with a historic story of the 
mysterious method of McKendrick for an- 
alyzing an epidemic study and its link to 
the EM algorithm, the most popular and 
powerful method of the twentieth century 
for fitting models involving missing data 
and latent variables. The remaining Ms 
touch on theoretical, methodological and 
practical aspects of missing-data problems, 
highlighted with some common applica- 
tions in social, computational, biological, 
medical and physical sciences. 

No further discussion seems necessary because this 
is a point on which apparently most agree; indeed, 
almost all the positive comments on Lee and Nelder 
(1996) were on praising their promotion and formu- 
lation of models via unobservables. 

3.2 Line Two: H-likelihood for Fixed Parameter 

The authors' Section 3 is where I saw the defi- 
nition of h-likelihood for the first time. Using the 
authors' initial notation, y denotes observed data, 9 
is the fixed parameter, and v I infer is what the au- 
thors regarded ELS cl random "unobservable." The h- 
loglikelihood is simply defined as h(9, v) = log fo(y, v) 



where fo(y,v) is the joint probability distribution/ 
density of {y, v}. 

In the rejoinder of Lee and Nelder (1996), the au- 
thors argued that the definition of h-likelihood is as 
logical as Fisher's likelihood. I agree. In fact, this 
point was well recognized in Berger and Wolpert's 
(1988) monograph on likelihood principle (LP) where 
they wrote (page 21.2), ". . . the LP should be formu- 
lated in such a way that 6 consists of all unknown 
variables and parameters that are relevant to the 
statistical problem." (Emphasis is original.) They 
proceeded to devote an entire section to the suc- 
cesses and challenges in extending the LP to include 
what they call "unobservable variables," just as in 
the authors' formulation. In fact, in addition to the 
observable X, they wrote (pages 36-37) 9 = (z;u) = 
(y,w;£,r)), "where z = (y,w) is the value of an un- 
observable variable Z with y being of interest and 
w being a nuisance variable, and where oj = (£, 77) 
is the parameter that determines the distribution of 
both X and Z, with £ being of interest and 77 be- 
ing a nuisance parameter." This quote shows that 
Berger and Wolpert's (1988) definition is the same 
as the authors', other than it takes a more explicit 
form by recognizing two different kinds of unobserv- 
ables, y and w, just as we often distinguish between 
primary parameter £ and nuisance parameter 77. 

The key question here, therefore, is what to do 
with it once it is defined. I shall discuss this point 
in Section 5. Here it suffices to note that the au- 
thors' initial proposal to maximize h(9,v) jointly 
over {9,v}, which they label MHLE (maximum h- 
likelihood estimation) as in Section 2.2 of Lee and 
Nelder (1996), can clearly lead to grossly inconstant 
or even meaningless estimators if it is taken as a 
general procedure. This was pointed out by the ma- 
jority of the discussants of Lee and Nelder (1996); 
as the authors stated later in the rejoinder of Lee 
and Nelder (2005), "The discussion was a disaster 
because everybody took the worst possible case of 
binary data and described difficulties with it. No- 
body said it worked in other cases." The example 
of Bayarri et al. (1988), reviewed in authors' Sec- 
tion 4.2, demonstrated that the defect has little to 
do with binary data. 

Indeed, earlier Little and Rubin (1983) provided 
four examples, three using standard univariate or 
bivariate (regression) normal models and one with 
a censored exponential model, to show that MHLE 
(though of course not in that term since Little and 
Rubin, 1983 predates Lee and Nelder, 1996) resulted 
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in seriously flawed/inconsistent estimators, unless 
the amount of missing data is (asymptotically) neg- 
ligible. The underlying issue is essentially the same 
as with the well-known Neyman-Scott problem (Ney- 
man and Scott, 1948). The message here is loud and 
clear: maximizing over unobservable/missing data, 
in general, is not a valid method. 

Evidently, the message has been appreciated by 
the authors, as they now make it explicit that for 
the "fixed parameters," their method is the same 
as Fisher's MLE, that is, maximizing the marginal 
log likelihood £(6) = log fg(y). This certainly should 
help to avoid the type of mis-communications the 
authors described in the paper (e.g., about Rubin 
and Little's 2002 comments). But this also means 
that no further discussion is needed either because 
there is no new advance here. 

However, for the sake of discussion, let me pick 
up on the authors' statement that "We view the 
marginal likelihood as an adjusted profile likelihood 
eliminating nuisance unobservables v from the h- 
likelihood." The issue is not much of the re-labeling 
itself, but rather that by making such a statement, 
the authors might be in danger of falling into the 
same trap that they have correctly warned others 
to avoid. The authors' "adjusted profile h-likelihood 
(APHL)," as far as I am able to understand, simply 
uses a Laplace approximation to replace the inte- 
gration called for by Bayesian marginalization (for 
nuisance parameter /unobservables) . Whereas such 
an approximation indeed is very useful and appeal- 
ing for practical purposes when the approximation is 
reasonable, it does not constitute a principled sta- 
tistical method in its own right unless a sound infer- 
ential principle is articulated for the approximation 
itself. Without such a principle, its performance can 
only be judged by how close the approximation is to 
the Bayesian target it approximates. In this sense, 
comparisons such as those given in the authors' Fig- 
ure 2 say little about the merit of the h-likelihood 
methods, but only reconfirm the usefulness of the 
Laplace approximations, or demonstrate the impact 
of the prior (which of course is not a part of the 
h-likelihood formulation). In other words, mixing a 
computation/approximation method with a statisti- 
cal method is as troublesome to me as mixing an 
estimation method with a statistical model is to the 
authors (and to me of course). 

Enough painless/itchless quibbles; let us get to the 
heart of the authors' proposal, that is, making in- 
ference about the unobservables via h-likelihood! 



4. WHAT ARE THE PRINCIPLES BEHIND 
THE H-LIKELIHOOD METHODS? 

4.1 Distinguishing Likelihood Principle, 
Likelihood Inference, and MLE 

The authors invoked several times the likelihood 
principle (LP) to justify their h-likelihood methods. 
But all the LP says, broadly speaking, is that if two 
data sets lead to the same likelihood, then they con- 
tain the same information, assuming the underlying 
model for each data set is correctly specified. The LP 
eliminates any procedure that violates it, but it says 
nothing about how to conduct a likelihood inference. 
As Berger and Wolpert [(1988), Chapter 5] put it, 
"The LP strikes us as correct, and behaving in vi- 
olation of it would be a source of considerable dis- 
comfort. Yet the LP does not tell one what to do (al- 
though insisting on methods based on the observed 
likelihood function certainly reduces the possibili- 
ties)." 

Indeed, there is a long list of methods in the do- 
main of "utilization of the likelihood function," too 
long even for Berger and Wolpert's (1988) mono- 
graph. I shall avoid repeating Berger and Wolpert's 
argument that the full Bayesian inference is actu- 
ally the most principled likelihood inference, since 
clearly the authors' intention here is to achieve what 
Bayesian methods achieve but without adopting the 
Bayesian philosophy; or, to self quote again (Meng, 
2008), "enjoying the Bayesian fruits without paying 
the B-club fee." But it is worthwhile to re-emphasize 
that the notion of likelihood inference is a very elu- 
sive one — any method that does not violate LP can 
be legitimately included (see Berger and Wolpert, 
1988). 

In contrast, maximal likelihood estimation (MLE) 
is a well-defined method, telling us exactly what to 
do with the likelihood function. It is this specific 
method that the authors' MHLE mimics. The afore- 
mentioned counterexamples demonstrate clearly that 
in general this imitation is only mathematical. The key 
question then is whether it is possible to find a set of 
useful and general conditions under which the im- 
itation is more fundamental, that is, under which 
MHLE preserves the underlying properties of MLE 
that guarantee its validity and efficiency? The an- 
swer turns out to be an intriguing "yes and NO." 
But before we get to that punch line, we will need 
the wisdom of an old friend, Mr. Bartlett. 
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4.2 Do Bartlett Identities Hold for H-likelihood? 

Finding the most likely parameter value that could 
have produced the observed data is intuitively very 
appealing — what else could be better? But of course 
as statisticians we know such reasoning by itself is 
flawed, because it puts us squarely in the hands of 
the Devil of Overfitting! There is clearly much more 
to Fisher's MLE than this flawed intuition. 

Probabilistically, a backbone of Fisher's ML 
method is the Bartlett identities, especially the first 
two. That is, for the (marginal) log-likelihood £(6; y), 
under the usual regularity condition that the sup- 
port of fe(y) does not depend on 9 S G, we have 



(4.1) 



d£(9;y) 



09 



\/9 € 0, 



(4.2) 



d 2 m V ) 
d9 2 

Wee 



+ E e 



( d£(9;y) 
{ 89 



V 89 



where Eg denotes the expectation under fe(y)- Here 
identity (4.1) guarantees that the normal/score equa- 
tion underlying the MLE method, 



(4.3) 



s(d;y)= d0 = o, 



is an unbiased estimating equation. Identity (4.2) is 
the basis for the asymptotic efficiency of MLE (un- 
der regularity conditions, of course) because it re- 
duces the general "sandwich" variance formula to 
the inverse of Fisher information, the Cramer-Rao 
lower bound. 

For these reasons, generalizations of (maximal) 
likelihood methods have largely tried to preserve 
these two identities, such as with the quasi-likelihood 
method (e.g., McCullagh and Nelder, 1989, Chapter 
9); see Mykland (1994, 1999) for other examples. It 
is therefore difficult to imagine that the issue of pre- 
serving them has not been investigated in general in 
the context of h-likelihood, given it is essentially a 
minimal requirement; indeed, when Engle and Keen, 
the lead discussants of Lee and Nelder (1996), wrote, 
". . . the usual first- and second moment properties 
exactly hold for /i-scores, for example, for normal- 
normal and Poisson-Gamma models. . . " I believe 
they were referring to the two identities above. I 
therefore surmise that it is my haphazardly selec- 
tive reading that makes the existing investigations 
unobservable to me. So I must offer my apologies to 
anyone, especially the authors, if I am reinventing 



the wheel below. But in any case I hope the mate- 
rial presented in the rest of this discussion will help 
to establish a firmer theoretical ground for inves- 
tigating the virtues and limitations of MHLE and 
other h-likelihood methods. 

Specifically, as we all know, identities (4.1) and 
(4.2) are consequences of 



(4.4) 



fi(dy) 



fe(y)fi(dy) = 1 



We0 



by repeatedly differentiating under the integral sign 
with respect to 9, which is legitimate when the sup- 
port Sly is free of 9 (and assuming the usual continu- 
ous differentiability of l{9\ y) as a function of 9; such 
conditions will be assumed below whenever needed). 
For log /i-likelihood, h(9,v;y) = logfg(y,v), clearly 
we still have 



(4.5) 



,v 



e h ^' v ' y) fj,(dy,dv) 



j fe{y,v)»(dy,dv) = l V9ee. 

" »2 y . v 



However, whereas we can still take (partial) deriva- 
tives with respect to 9 on both sides of (4.5) to ar- 
rive at useful identities, obviously taking the deriva- 
tive of both sides with respect to v would produce 
= 0. This death of the old trick signifies a key dif- 
ference between the /i-likelihood and Fisher's likeli- 
hood, even if we put aside cases where v is discrete 
and hence taking derivatives is not even an option. 
Here we remark that unlike Fisher's likelihood where 
discrete parameters are rare (other than with model 
selection problems), discrete unobservable/missing 
data are common which poses an additional chal- 
lenge to the MHLE method. But clearly the authors' 
current proposal focuses on continuous v, so we will 
proceed in this setting. 

5. ENCOURAGING NEWS: H-LIKELIHOOD 
IS BARTLIZABLE 

5.1 Necessary and Sufficient Conditions for 
Bartlett Identities 

Without the old trick, we have to directly inves- 
tigate if and when (4.1) and (4.2) can be extended 
to /i-likelihood. Specifically, when we let (ft = {9,v}, 
and write 



(5.1) 



h(4>;y) = log f e (y\v) + log f e (v), 
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we see the "troublemaker" is the second term be- 
cause for the first term, v plays the same role of 
a fixed parameter for the conditional distribution 
fo(y\v), and hence the old trick of differentiating 
under integration is applicable. In particular, as an 
application of (4.1) and (4.2) when conditioning on 
v and assuming the support of fg(y\v) does not de- 
pend on either 9 or v, we have, for any 9 € G, 

~dlogfg(y\v) 



(5.2) 



Eg 



d 2 log fg(y\v) 



0. 



dlog f e (y\v)\ ( d\ogf g (y\v 



En 



(5.3) + Eg 



= 0. 

Consequently, under the additional assumption that 
the support of fe(v) does not depend on 9, (5.1) and 
(5.2) imply that, for any 9 e 0, 



E e 



dh(fcy) 



Eg 



d\ogf e {v) 



(5.4) 




Furthermore, noting that the cross terms in the 
quadratic expansion below are zero by first condi- 
tioning on v, we have from (5.1)-(5.3), 

T 



Eg 



(5.5) 



d 2 h(c(>;y) 



+ Eg 



( dh{<t>;y) \ ( dhjfry) 



Eg 



d 2 log fg(v) 



by applying (4.2) to log fg(v). For the term B, one 
can easily verify that 



B = Eg 



d 2 log fg(v) 



89 dv 



(5.7) 



+ Eg 



dlog fg(v)\ ( dlog fg(v 



09 



dv 



dlog fg(v) 
dv 



Wee, 



and hence it will also be zero if Eg[ dlog J v B ^ 
all 9 e 0. Finally, simple algebra shows 



for 



C = Eg 



(5.8) 



d 2 log fg(v) 
dv 2 

dlog fg(v)\ ( dlog fg{v 



+ 



dv 



dv 



d 2 fg(v) 
dv 2 



n{dv). 



Combining (5.4)-(5.8) yields the following straight- 
forward but key result. 

Theorem 1. Let h(<p;y) = logfg(y,v) be a log 
h-likelihood where <p = {9, v}, 9 e is the model pa- 
rameter, v is a continuous unobservable with density 
fg(v) with respect to a measure fi, and let Sg(v) = 
9log J^^ . Furthermore, assume the support of fg(y\v) 
does not depend on either 9 or v ( almost surely with 
respect to fi), the support of fg(v), denoted by Q v , is 
free of 9, and all continuity and differentiability con- 
ditions hold whenever needed. Then the first Bartlett 
identity holds for the h-likelihood, that is 



(5.9) 



Ea 



dh((p;y) 



V#€9 



if and only if 



+ Eg 



( d\ogfe(v) \ f dlogfg(v) \ T l (51Q) E [S ( v)] =[ WM^ dv)= o wee. 

\ dcf> J\ d<P J Jn v dv 



A B 
I B T C 



In the above expression, 
~d 2 log f e {v) 



Assuming (5.10), then the second Bartlett identity 
holds for the h-likelihood; that is, 

~d 2 h(<f>;y)~ 



A = Eg 



(5.6) 



Eg 



+ Eg 



d9 2 

dlog fg(v) 
d9 



dlog fg(v)\ T 



(5.11) 



+ Eg 



09 



'( dh{fry) \ ( dh(</>;y) 
\ dcf> J\ d(p 

V#e6 
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if and only if 

'dSo(v) 



(5.12) 



dv 



+ S e (v)Sj(v) 



d 2 fe(v) 

}„2 



n(d v ) = o V0 e e. 



5.2 Yes: It is Easy for H-likelihood to Produce 
"Un-sandwiched" Estimating Equation 

Theorem 1 is somewhat remarkable because the 
necessary and sufficient conditions (5.10) and (5.12) 
are determined purely by the marginal distribution 
of the unobservable v, and hence they are easy to 
check. For example, in Bayarri's example quoted by 
the authors, the marginal density of the unobserv- 
able u is exponential with mean A = 6" 1 . Conse- 
quently, Sg(u) = —6, and hence condition (5.10) is 
violated for all 6 > 0, as is condition (5.12). This 
means that whenever u is used for the h-likelihood, 
the resulting h-score will never form an unbiased es- 
timating equation, regardless of the model for fo{y\u)\ 
Indeed, we have seen from the authors' Section 4.2 
that the corresponding MHLE leads to meaningless 
estimates. 

In contrast, when we use v = logu, f$(v) = 9e v ~ SeV , 
and hence Sg(v) = 1 — Be" = 1 — u/X and S' g (v) + 
S$(v) = -9e v + {l-9u) 2 = -u/X + (u-X) 2 /X 2 . Both 
conditions (5.10) and (5.12) then follow trivially be- 
cause Eq(u) = X and Vq{u) = A 2 . Consequently, the 
authors' h-score is not only an unbiased estimating 
equation but also an "optimal" one in the sense that 
we do not need the usual "sandwich" formula, but 
only the Hessian matrix, for "valid" variance esti- 
mation. Unfortunately, I have to put both "optimal" 
and "valid" in quotes because of the bad news I will 
deliver in the next section. But as far as for preserv- 
ing Bartlett identities goes, which by itself does not 
guarantee valid statistical inferences, I can share the 
authors' optimism for the future of MHLE, espe- 
cially because of the following somewhat even more 
surprising result, which says that conditions (5.10) 
and (5.12) hold quite easily for many unobservables 
or their simple transformations. 

Theorem 2. Under the same setting as in The- 
orem 1, suppose the support of fe(v), U v C R d , takes 
a rectangle form, 0„ = nf=i [ a ji^i\> where aj or bj 
is permitted to take the value of +oo or —oo. Let 
dCl v be the boundary set of £l v (i.e., the set of all 
points whose coordinates contain at least one aj or 
bj), and assume the dominating measure fj, is the 
Lebesgue measure on R d . We then have: 



(I) If f$(v) = for all v 6 dQ, v , then condition 
(5.10) holds, and hence the first Bartlett identity 
(5.9) holds. 

(II) If in addition = also holds for all 

v € dVL v , then condition (5.12) holds, and hence the 
second Bartlett identity (5.11) holds. 

Proof. For (I), because of (5.10), if v is univari- 
ate, that is, if d = 1, then 



dfejv) 
dv 



dv 



hi 



dfe(v) 



ai 



(5.13) 



fe(h)-fe(ax) 
0. 



under our assumption that fe{v) vanishes on the 
boundary. For d > 1, we apply the same argument 
to each of the d integrations that form the leftmost 
vector in (5.13), that is, j n dv, k = 1, . . . , d, by 

integrating with respect to v^ first to conclude that 
it is zero for all 8. 
For (II), we first note that for any {k, s}, 



dv 



(5.14) 



d 2 fe(v) 
n v dv k dv s 

d fdf e (v) 



u v dv k V dv s 



dv. 



Hence, using the same argument as above but with 
fe{v) replaced by 9 q£ , we can conclude It, s = 
for all k, s = 1, . . . ,d. Consequently, condition (5.12) 
holds. □ 

What this result says is that as long as the marginal 
density of the unobservable v vanishes on the bound- 
ary of its support, the first Bartlett identity holds 
for h-likelihood. In addition, if its derivative also 
vanishes on the boundary, then the second Bartlett 
identity holds. This provides an even easier way to 
verify Bayarri's example. For the original unobserv- 
able u, fe{u) = 6e~ Su , with boundary points u = 
and it = oo. But since fg(0) = 9, the vanishing con- 
dition is violated as long as 9 > 0. In contrast, for 



v = logu, f e (v) 



jo-Be" 



, with boundary points v 



— oo and v = +oo. It is easy to see that fg(— oo) 
fg(+oo) =0 for all 8. Furthermore, since 



dfejv) 
dv 



9(e 



v-6e v 



2v-0e v 



)■ 



the derivative also vanishes at both v = — oo and 
v = oo. Therefore, both Bartlett identities hold for 
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h-likelihood when v = log u is used as the unobserv- 
able. For simplicity, we will label the process of find- 
ing a transformation that makes Bartlett identities 
hold Bartlization ("Bartlettlization" is too much of 
a tongue twister!). 

An astute reader may have noticed that I did not 
say that failing the vanishing condition is the rea- 
son for the failing of the Bartlett identities for the 
original scale u. The vanishing condition is suffi- 
cient, but not necessary. This can easily been seen in 
(5.13), which only requires fe(a>i) = fe(bi). Indeed, 
the Bartlett identity fails for the original scale u pre- 
cisely because fg{u = +00) = but fg(u = 0) = 9, 
and hence Eg[Sg(u)] = — 9 = —9, as verified di- 
rectly previously. A necessary and sufficient condi- 
tion via integration on <9Q„ is not hard to obtain, 
but it requires a bit more mathematical treatment 
than is needed for most practical applications, for 
which Theorem 2 is adequate. Here we just mention 
that we can generalize Theorem 2 by allowing £l v 
to be an arbitrary simply connected manifold in R d 
(i.e., a manifold with "no hole"), and then invoke 
the generalized Stokes' theorem (see Marsden and 
Tromba, 2003) to equate the integration of dw on 
Q, v to that of w on the boundary dVL v where w is a 
so-called d — 1 differential form which can be taken 
in terms of fg(v) or its derivative as needed. 

The authors stated in the rejoinder of Lee and 
Nelder (2005) that "We do not say that the cur- 
rent h-likelihood method will always perform the 
best, but we believe that it can always be modi- 
fied to give an improvement, as has been done with 
Fisher's likelihood method." I believe the alluded-to 
improvements lie in using higher order Bartlett iden- 
tities, such as the third identity for "Bartlett correc- 
tion" for the likelihood ratio tests (e.g., McCullagh, 
1987). Clearly Theorem 1 and Theorem 2 have their 
higher order generalizations, but it is already 9:14 
pm of the second Sunday. My teenagers' visit to Dr. 
Langdon is already postponed for another week, so 
I had better leave such generalizations to a future 
paper. More importantly, as much as I am enjoying 
discovering the "Bartlizability" of h-likelihood, I do 
not see a way to correct the more fundamental prob- 
lem described in the next section, which potentially 
makes "Bartlett-corrected h-likelihood" an exercise 
that is literally just a homework exercise. 



6. BAD NEWS AND A PUZZLE: 
FISHY OR FIDUCIAL? 

6.1 NO: It is Hard for log H-likelihood to be 
Summarizable Quadratically 

Having the Bartlett identities is only a part of 
the story. What it guarantees is that if the log h- 
likelihood can be approximated quadratically, then 
the mode and the Hessian matrix derived from it 
will provide an approximately correct estimator and 
its associated (inverse) variance. To examine this is- 
sue more clearly, let us mimic the formal asymptotic 
argument behind the estimating equation approach 
which relies on the expression 



(6.1) 



I-\9)S{<t>;y) + R, 



where <f> is the MHLE, S(<f>; y) = is the h- 

score, and Ih(9) is the h-likelihood extension of the 
expected Fisher information, the expected Hessian, 



(6.2) 



h(0) = E 6 



We emphasize here that unlike the original Fisher 
information, Ih(9) is not generally guaranteed to be 
positive definite (so I^ 1 (9) may not even exist) un- 
less condition (5.12) holds; see Section 7 for an ex- 
ample. 

Expression (6.1) by itself is tautological, because 
there is always an R to make it hold; in particular 
it can be derived from a remainder term in the Tay- 
lor expansion of S{4>;y) — S(4>;y). However, when 
R is (asymptotically) negligible, (6.1) allows us to 
conclude that the distribution of — can be ap- 
proximated by that of T(9; y) = I^ 1 (9)S((f>; y) which 
has mean zero when the first Bartlett identity holds 
and variance 1^(9) when the second Bartlett iden- 
tity holds. 

When h is a regular Fisher's likelihood, under reg- 
ularity conditions, the R term is asymptotically neg- 
ligible compared with the first term on the right- 
hand side of (6.1). A key reason for this is the ac- 
cumulation of information as we collect more data; 
eventually we will have zero uncertainty about the 
parameter, at least in theory. Unfortunately, for h- 
likelihood, this cannot be true in general even in 
theory because no matter how much data we accu- 
mulate, it cannot possibly eliminate the uncertainty, 
say, in predicting a future outcome, such as in the 
authors' Example 4. This lack of accumulation of 
information for unobservables is essentially the key 
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problem pointed out by multiple discussants (e.g., 
both lead discussants) of Lee and Nelder (1996), 
with both theoretical and empirical examples. 

Without the accumulation of information to jus- 
tify the central limit theorem or the law of large 
numbers, we actually will run into two problems 
with the standard asymptotic arguments for (6.1), 
even if the first two Bartlett identities hold. The most 
obvious and critical one is that since R is not negligi- 
ble, we cannot approximate the distribution of <f> — (ft 
by that of T(9;y) = I^ 1 (9)S((j>;y); indeed, without 
R being negligible, the MHLEs are not guaranteed 
to be consistent, as in all examples of Little and 
Rubin (1983). It is of critical importance to stress 
that the Bartlizable property of h-likelihood itself 
has little bearing on the issue of being quadratically 
summarizable, that is, the R term being negligible. 
Indeed, in all normal examples of Little and Ru- 
bin (1983), the h-likelihood is naturally Bartlized 
because clearly the normal density and any of its 
derivatives vanish on the boundary of its support, 
yet the MHLE produces inconsistent estimators be- 
cause of the nonnegligibility of the R term. The more 
subtle one is that regardless of whether R is negli- 
gible or not, we may not be able to justify the usual 
normal approximation T(9;y) ~ N(0,I^ 1 (9)), even 
if T(#; y) has mean zero and variance matrix I^ 1 (9). 
(Of course, when R is not negligible, the properties 
of T are not really relevant.) Section 7 will illustrate 
all these points via a simple but very informative ex- 
ample. 

6.2 And a Puzzle: The Meaning of the 
H-distribution 

Even if R is exactly zero and all Bartlett identi- 
ties hold, the h-likelihood method, as a method for 
predicting the unobservables v, still faces a funda- 
mental challenge. That is, what is the meaning of 
the resulting distribution f(v\y), which I shall term 
the h-distribution for obvious reasons? If one is will- 
ing to assume a constant prior on 9, then of course 
this has a Bayesian interpretation as a posterior pre- 
dictive distribution or an approximation to it. But 
the authors specifically emphasized that they did 
not want to specify a prior on 9, for their goal is 
to provide an alternative method to the Bayesian 
approach. 

Some Bayesians may be agitated by having a 
method that is mathematically or numerically equiv- 
alent, in general, to a Bayes method (perhaps un- 
der a particular prior), but is labeled as something 



else. I am much less troubled, provided that (1) the 
connection is clearly spelled out, and (2) there is 
a well-articulated non-Bayesian principle justifying 
the method. The authors clearly have done (1), but 
for (2) all I can find is authors' desire to conduct a 
probabilistic inference for v without having to spec- 
ify a prior for 9. At the conceptual level, I have the 
very same desire because of my frustration, which I 
am sure some share, with the apparent impossibility 
of constructing a truly "noninformative" prior (for 
continuous parameters, at least). I also very much 
appreciate the authors' emphasis that the "plug-in" 
empirical Bayes is not a satisfying method, precisely 
because "plug- in" is an ad hoc method. So indeed I 
was quite excited when I thought that the authors 
had found a way to meaningfully specify a proba- 
bilistic f(v\y), without considering 6 as a random 
variable. 

At a practical level, the authors did provide a 
number of "well-specified" h-distributions, either via 
(the implied) normal approximation with mean and 
variance obtained from the MHLE/Hessian matrix 
for v or the APHL approximation by profiling out 
9. But without spelling out the probabilistic mean- 
ing of such resulting distributions, it is essentially 
impossible to answer the criticism that the label 
of h-likelihood is a red herring because they are 
just approximations to Bayesian solutions instead 
of the products of a genuine competing method as 
claimed. More importantly, without knowing what 
"gold standard" they aim to approximate, we have 
no meaningful ways to evaluate how good the ap- 
proximations are, or even to specify a probabilistic 
evaluation mechanism; in what real or thought ex- 
periment can it be realized? 

Indeed, the lack of a distinct and justifiable mean- 
ing of the h-distribution apparently has put the au- 
thors in an awkward position in terms of demon- 
strating the merit of their methods. From the pa- 
pers I read, it appears that the authors have two 
kinds of comparisons. The first is to compare an h- 
distribution to a Bayesian one, and to "validate" 
the h-distribution by showing how close it is to the 
Bayesian counterpart. But this only strengthens the 
aforementioned "red herring" criticism, and provides 
evidence for — not against — the kind of statements 
made by my twin brother quoted previously. Clearly 
this is contrary to the authors' intention, and I be- 
lieve is part of the reasons for the continuing dis- 
crepancy between the authors' enthusiasm for and 
others' reluctance toward the h-likelihood methods. 



10 



X.-L. MENG 



The second type is something that I have not 
seen before, at least not in academic publications. 
The authors seem to take their methods as the stan- 
dard, and compared everything else to it, as sug- 
gested by the statement, "In the salamander data, 
among other methods considered, the MCEM of 
Vaida and Meng (2005) gives the closest estimates 
to the h-likelihood estimators." Such comparisons 
would be meaningful if the superiority of the h- 
likelihood results had already been demonstrated 
either by theoretical proof (e.g., optimality of some 
sort) or by a distinctive principle that is not sub- 
sumed or invalidated by accepted ones. But even in 
such cases, the value of this type of comparison is to 
demonstrate the performance of other methods, not 
the merit of the h-likelihood method itself. 

6.3 Fiducial Argument via Predictive 
Pivotal Quantity? 

As I tried in vain to form a thought experiment 
that would meaningfully define the h-distribution 
f(v\y) without slipping into the Bayesian mode, I 
looked hard into the authors' writings for clues about 
what they had in mind. The first clue came from 
Section 3.1 of Lee and Nelder (1996), where they 
showed that, in the context of the models they were 
investigating, a log h-likelihood expression in their 
(3.2) can be expanded into their expression (3.3) 
which is a quadratic term —(v — v)' D*{v — v)/2 plus 
a term that depends on y only (their v is the same 
as the v notation here). They then wrote, "Ignoring 
the constant term, which depends only on y and not 
on v, expressions (3.2) and (3.3) imply that 

v\y ~ Niv^*- 1 ) 

would be a good approximation for the distribution 
ofv\y." With apologies to the authors in case I mis- 
understand their notation or there was a misprint- 
ing, this reasoning smells either fishy or fiducial, de- 
pending on the meaning of "the distribution of v\y." 

First, if by "the distribution of v\y" is meant the 
sampling distribution of v given both y and 9, then 
the reasoning underlying the above statement would 
contain the elementary flaw of confusing a marginal 
distribution of X\ — X2 with the conditional distri- 
bution of X\ — X2 given X\. This is because, even if 
the normal approximation is justified, the quadratic 
term above is for the marginal distribution of v — v, 
as v and v, which is a function of y only, vary jointly 
according to f(y;v\9). [I switch the notation from 
fo(y; v) to f(y;v\9) to emphasize the conditioning on 



9, even though the latter notation may imply that 9 
is a variable being conditioned upon, something the 
authors' approach aims to avoid.] This marginal dis- 
tribution clearly is not the same, in general, as the 
conditional distribution f(v — v\v,9) or f(v — v\y,9) 
(note in general that these two distributions are also 
different unless y and v are independent given 9). 
This can be most clearly seen from (6.1) where all 
the distributional calculations are with respect to 
the joint distribution f{y;v\9), not the conditional 
distribution f(v\y;9). 

Of course, this is unlikely to be what the authors 
intended, since their goal is to capture v\y without 
conditioning on 9. But the notation f(v\y) has no 
definition or meaning under the authors' joint mod- 
eling specification f(y,v\9) because 9 is treated as 
fixed. This brings me to the second "smell," that is, 
the authors were invoking a fiducial- like argument, 
by implicitly defining their conditional h-distribution 
v \y as the sampling marginal distribution ofv — v un- 
der the joint distribution f(y,v\9), and getting rid of 
its dependence on 9 when v — v is (asymptotically) 
a predictive pivotal quantity, meaning that its distri- 
bution is free of any unknowns. We can also think 
of this way of eliminating the nuisance parameter 
9 for the purpose of prediction as seeking predictive 
ancillarity, that is, a function of both y and v whose 
distribution is free of 9. See the example in Section 7 
for an illustration. 

6.4 A Duality or Prestidigitation? 

The second piece of evidence from the authors' 
writing seems to confirm this interpretation. In the 
comparisons of their methods with the Empirical 
Bayesian method, they compared the Bayesian pos- 
terior predictive variance of v\y with the estimator 
obtained from the Hessian matrix. To make this 
comparison more explicit, let us denote r{9;y) = 
V(v\9; y) and e(9; y) = v(y) — E(v\9; y). Then by the 
law of iterated expectations (or the so-called EVE 
formula) and noting that v is determined by y, we 
have 

V(v\y) = V(v-v\y) 

(6.3) 

= E[r(9;y)\y} + V[e(9;y)\y], 

(6.4) V(v - v\9) = E[T(9;y)\9] + V[e{9;y)\9\. 

The authors' argument seems to implicitly rely on a 
"duality," that is, the two mean terms on the right- 
hand sides of (6.3) and (6.4) are (asymptotically or 
approximately) the same; so are the two variance 
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terms. That is, we can switch the required mean and 
variance calculations under f{y\9) in (6.3) to that 
under f(6\y) in (6.4). Fisher's fiducial argument, as 
far as I can understand, aimed to establish the va- 
lidity of this switching on its own without viewing it 
as an approximation to the Bayesian method (with 
a constant prior). There is nothing wrong with in- 
voking the fiducial argument (well, actually there 
is but it depends on who one asks); indeed there 
has been a recent surge of interest in it, especially 
in connection with the "generalized confidence" ap- 
proach [e.g., Hannig, Iyer and Patterson (2006) and 
Hannig (2009)]. Perhaps the authors' approach is 
the next step, that is, using the fiducial approach 
for prediction, not just for estimation. But without 
being told explicitly about this switching, a reader's 
reaction would be anybody's guess. A suspicion of 
prestidigitation? A deja vu feeling of reading De- 
ception Point instead of De Vinci Code? Or even 
worse, an accusation of the prosecutor's fallacy? 

Finally, even if we buy the fiducial argument, it 
does not follow that the left-hand side of (6.4) can 
be well approximated by (an appropriate element 
of) the inverse of the Hessian matrix because of the 
non-negligibility of the R term, as discussed before. 
The authors, of course, well recognized this, and 
hence invoked the APHL method to approximate 
(define?) the h-distribution f{v\y) instead of relying 
on the normal approximation. While this approach 
indeed "works well," in the authors' example and in 
the example I am about to present, I have to put 
"works well" in quotes when the success is judged 
by comparing how close the h-distribution is to the 
posterior predictive distribution under the constant 
prior. But I'd be happy to remove the quotation 
marks if the evaluation is based on the aforemen- 
tioned pivotal predictive framework, because that 
is a distinctive principle, regardless of whether one 
subscribes to it or not. 

7. SHOW AND TELL: ESTIMATION AND 
PREDICTION WITH EXPONENTIAL 
DISTRIBUTION 

To illustrate various general points made in Sec- 
tions 4-6, let us consider a simple case where the 
data are an i.i.d. sample y = {y±, . . . ,y n } from an 
exponential distribution with mean A with the unob- 
servable being u = y n +i, a future observation. This 
example is different from Bayarri's two-level expo- 
nential model because here we only have one level, 



as in the authors' Example 4. It is hard to have faith 
in a method for multi-level hierarchical models if it 
cannot handle single-level models. 

7.1 Why does the Original Scale Fail? 

As we discussed in Section 5.2, when the exponen- 
tial variable u = y n +i is used as the unobservable, 
the Bartlett identities fail. In the current setting, 
this can be seen directly by noting that (where y n 
denotes the sample mean of {yi, . . . ,y n }) 



(7.1) h(X,u;y) = -(n + I) log X 



ny n + u 
A ' 



which clearly does not have an internal mode be- 
cause it is linear in u > 0. Indeed, the h-score equa- 
tion, 



(7.2) 




leads to the meaningless estimator A = +oo. Inci- 
dently, this is also an example that lh(0), as defined 
in (6.2), is not nonnegative definite because the sec- 
ond Bartlett identity fails. Specifically, by further 
differentiating the expressions in (7.2), it is easy to 
verify that 



ny n + u 



h(9) 




A 2 




which clearly fails to be nonnegative definite. 

7.2 A Simple Transformation is All it Takes 

However, when the h-likelihood uses v = log(ti) as 
unobservable, it satisfies both conditions of Theo- 
rem 2 as verified in Section 5.2, so the correspond- 
ing h-likelihood is Bartlized. To see this directly, be- 
cause 



(7.3) h(X,v;y) = -(n + l)logX 



ny n + e v 
X 



+ v, 
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the h-score equation becomes 
dh 
d\ 
dh 

dv 



(7.4) 



n + 1 ny. 



+ e v 



X 2 



4- 



0. 



This delivers the correct MLE for A, A = y n , and a 
very sensible point prediction for the future obser- 
vation, u = e v = A = y n . 
Furthermore, the expected Hessian matrix is 



4(A) = E x 



(7.5) 



/ n + 1 _ ^ ny n + e l 



A 2 



A 3 



A 2 



V 



e L 
A 2 



/ n+1 
V 



A 2 
1 

~A 



It is easy to see that when evaluated at MLE 
(=MHLE), A, 1^ (A) is identical to the observed Hes- 
sian matrix I° bf 



h 



/n + 1 ny n + e v 



robs 

-if, = 



A 2 



A 3 



(7.6) 



V 



A 2 
"T7 



/n+1 



V 



1 

A 



where the equality holds because A = y n = e v . The 
fact that these two Hessian matrices coincide also 
gives us another indication that the MHLE/Hessian 
matrix can behave just like MLE/Fisher information 
for regular exponential families. 

7.3 So How Good is the Approximation? 

Now let us examine the inverse of I/j(A), 
/A 2 A 



(7-7) /^(A) 



V 



n 
A 



n 



1 + 



1 



_2 



n n 

If the R term in (6.1) is negligible, then the above 
matrix should provide the (asymptotic) value of 
V\(4> — 4>) where <\> = {X,v} and the variance opera- 
tor V\ is with respect to the joint sampling distri- 
bution f\(y,v). Clearly, r| = A 2 /n is exactly right 
because it is V\(\). To examine the other entries, 



we first recall that for large n, Taylor's expansion 
(i.e., the 5-method) justifies the approximation 

A 



(7.8) log(y n ) - log(A) 



Vr, 



A 



Adopting this approximation, and noting that v 
log(y n +i) is independent of y n given A, we have 



(7.9) 



Cova(A,-0 - v) = Cov A (y„,log(y n )) 

« Cov\{y n ,z n ) = - 
n 



which is the same as t\ >v . 

Similarly, by (7.8), V\(log(y n )) sa V(z n ) = 1/n, and 
hence we have 



(7.10) 



V x (v -v)= V x (log(y n )) + V x (log(y n+1 )) 

«- + y A (log(y n+ i)). 
n 



This would be the same as r 2 if Vx(log(y n+ i)) = 1. 
But unfortunately this is where the MHLE/Hessian 
matrix approximation breaks down. One can directly 
verify or use the property of Gumbel distribution 
(recall log of an exponential variable is a Gumbel 
variable) to arrive at 

7T 2 

(7.11) y A (log(y n+ i)) = — = 1.6449. . 

6 



which is considerably larger than 1. [Incidentally, the 
integrating moment generating function approach 
(Meng, 2005) can be used to calculate V\(log(y n )) 
exactly for general n, if needed.] 

7.4 So What Works and What Does Not? 

To see more clearly what went wrong, let us write 
out the R term in (6.1) explicitly for the current 
model. Using (7.4) and (7.7), simple algebra reveals 
that (6.1) becomes 



(7.12) 



,log(2M 

' _ Vn ~ A 
Vn ~ Vn+1 
A 



-A 

log(y n +i] 



+ 





Rv,n 



where R V)U obviously makes up the difference be- 
tween v — v and (y n — y n +i)/ A, but it would be more 
useful to express it in the equivalent form 

\ og (y_n\ ft. -A" 



R v 



(7.13) 



A 



log 



A 



Vn+l \ Vn+1 ~ A 



A 



A 
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From these expressions, we see that the MHLE/ 
Hessian matrix approach works perfectly for the es- 
timation of A — it is the same as MLE and with 
the correct variance estimator because its R term 
is exactly zero. However, for the prediction of v, 
two things went wrong, and both are due to the 
failure of accumulation of information. First, R V)U 
is not negligible compared with the leading term 
Z v ,n = {Vn ~ 2/n+i)/A. Indeed, as n -)• oo, R V)Tl -)• 
Roo = f — 1 — log(C) and Z v ^ n -+ = 1 - £ where 
£ is an exponential variable with mean one. In fact, 
while E(Z OQ ) = 0, E(R oa ) is far from zero, taking 
the value of Euler's constant, 7 = 0.5772.... This 
failure obviously is due to the nonapplicability of 
the Taylor expansion (7.8) when n = 1; if this were 
applicable, then V(log(y n +i)) = V(v) would be ap- 
proximated by V(z n ) = 1, leading to t% = 1 + — for 
V(v-v) in (7.7). 

Second, although Z^ has mean zero and vari- 
ance one, its density function f(z) = e z_1 , with sup- 
port (— oo, 1], is far from that of the normal. Indeed, 
/(1)/0(1) > 5, where <f>(z) is the p.d.f. of N(0,1). 
But of course the distribution of Zoo or Z v n is not 
even relevant because we cannot use either of them 
to approximate the sampling distribution of v — v 
due to the nonnegligibility of R v ^ n . 

7.5 3-in-l: Pivotal Predictive Distribution, 
Posterior Predictive Distribution, and 
H-distribution 

The exact distribution of v — v, of course, can 
be worked out easily in this case. But it is impor- 
tant to emphasize that by moving from the orig- 
inal u = yn+i scale to the v = log(y n +i) scale, we 
have obtained a predictive pivotal quantity. That 
is, whereas the sampling distribution of u — u = 
y n +i — Vn depends on the unknown A, the distri- 
bution of v — v = log(y n+ i/y n ) is free of A because 
it is canceled in the ratio as the scale parameter. 
Consequently, the v scale provides us a way to con- 
struct exact prediction intervals without having to 
worry about A which is a nuisance parameter for 
the purposes of prediction. This is simply the pre- 
dictive version of the usual inference of parameter 
of interest based on a pivotal quantity. Although 
such a construction is by nature a frequentist one, 
it should help to understand the importance of the 
choice of scale of the unobservables for the authors' 
approach. Evidently, this consideration of pivotal 
quantity greatly restricts the family of scales for 
unobservables, beyond the minimal requirement of 



preserving the (first two) Bartlett identities, as dis- 
cussed in Section 5. 

Indeed, it is informative to compare the three dis- 
tributions here: (I) the sampling distribution f\(v — 
v), (II) the posterior predictive distribution f B {v\y) 
under constant prior and (III) the h-distribution 
f H (v\y) derived from the authors' APHL method. 
For (I), because U n = Y17=l Vi ~ Gamma(n, A) is in- 
dependent of u = y n +i ~ Gamma(l, A), we know the 
ratio B n = U n / (U n + it) is distributed as Beta(n, 1). 
Consequently, r = y n +i/]Jn = n {B~ l — 1) follows a 
Pareto distribution of order n + 1, that is, 



(7.14) f{r) 



1 + 



n 



-(n+l) 



r > 



which converges to e~ r as n — > oo, as it should. 
[The distribution f(r) obviously determines the dis- 
tribution of f — v = log(r).] 

In comparison, for (II), because f(yi, . . . ,y n \X) oc 
\~ n e~ u "/ x , a posteriori we can write A = C/ n 7 _1 , 
where 7 ~ Gamma(n — 1, 1). Consequently, because 
u = A£ where £ ~ Gamma(l, 1) and is independent 
of 7, a posteriori we have it = U n (^/'y). This implies 
r = nu/U n = n£/7 = n{B n _\ — 1) where B n -\ ~ 
Beta(n — 1,1); here we assume n > 1 as the pos- 
terior is improper when n = 1 under the constant 
prior on A. It follows that 



(7.15) f B {r\y) 



n 



-If r 
— 1 + - 

n V n 



r > 0. 



For (III), we note from the first equation of (7.4) 
that for any given v, the h- likelihood is maximized 
at 



(7.16) 



X(v) 



ny n + e l 
n + l 



From (7.3), the log profile h-likelihood then becomes, 
ignoring irrelevant constants, 

(7.17) h x (v;y) = -(n + l) log X(v)+v. 

Using the authors' notation and (7.5), D(h,X) = 
_d_h^Ly)_ _ ^ n _j_ iy^2 -whgn \ — X(v), and hence 
the authors' (log) adjusted profile h-likelihood be- 
comes, again ignoring irrelevant constants, 

h\(v; y) = -(n + l) log X(v) + v 

(7.18) - \\og{D(KX(v))) 
= — ralog A(i>) + v. 
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The h-distribution for v then, as I understand from 
the authors' approach, is to set 

f H (y\y) oc extort 

(7.19) 

= e v X~ n (v)oce v (U n + e v )- n . 

Converting this to the distribution of r = nu/U n = 
ne v /U n and re-normalizing it to be a proper distri- 
bution, we have, again assuming n > 1, 

(7.20) f H {r\y) = — + , r>0 

n \ n J 

which is identical to the posterior predictive distri- 
bution (7.15). This is expected because of the ac- 
curacy of the Laplace approximation (and by re- 
normalizing we eliminate the remaining approxima- 
tion inaccuracy). 

7.6 The Need of Choosing the Right Scale for 
the Fixed Parameter 

A perceptive reader may realize that the small dif- 
ference between (7.14) and (7.15) or (7.20), although 
of little practical consequence, nevertheless points to 
a deeper issue. Indeed, if we use the constant prior 
on log(A), the most common "noninformative" prior 
for scale parameter, then f B (r\y) will be the same 
as /(r) of (7.14). This suggests an intimate connec- 
tion between posterior prediction and the pivotal 
approach on the joint space of {y,v}. 

For h-likelihood, we have seen that choosing the 
right scale for the unobservable is crucial. However, 
the scale of the parameter also plays a role, espe- 
cially for the adjusted profile h-likelihood because 
the value of D(h,a) depends on the scale of a. For 
example, in the current example, if we also choose 
the log scale for A, that is, use h(r],v;y) to carry out 
all the h-likelihood calculations where r] = log(A), 
then D(h, rj) = n + 1. Consequently, the adjustment 
becomes immaterial, making the log APHL the same 
as (7.17), the original profiled log h-likelihood. This 
is easily seen to lead to 

(7.21) f H (r\y)=ll + -\ , r > 0, 

which is now identical to the pivotal predictive dis- 
tribution f(r) in (7.14), a truly 3-in-l! 

This equivalence not only demonstrates the inti- 
mate connection among the three methods, but also 
suggest the possibility of providing a probabilistic 
meaning to h-distributions, at least in some cases. 



For example, under (7.14), a 1 — a highest density 
predictive (HDP) interval is of the form 

(7.22) HDP=[0, c(a,n)y n ], 

where c(a,n) =n(a~ 1 / n — 1) — > — log(a). 

This interval has both Bayesian interpretation and 
frequentist interpretation, the latter of which I be- 
lieve is closer to what the authors have been seeking. 
The frequentist interpretation is simply that among 
repeated samples of {y\, . . . , y n , y n +i}, the HDP in 
(7.22) covers y n +i with frequency/probability 1 — 
a. Such interpretation perhaps is more appealing 
to some than its posterior predictive interpretation 
which in this case is actually not directly realizable 
with random A because it is derived under the im- 
proper prior 7r(A) oc A" 1 . It is somewhat intriguing 
that this un-realizable posterior predictive distribu- 
tion via random A is easily realizable via the pivotal 
predictive distribution. A general investigation of 
this connection may offer new insights into both the 
similarities and differences between Bayesian and 
sampling inferences. 

8. EPILOGUE 

Dan Brown concluded Angels and Demons with 
Dr. Langdon's religious experience with Vittoria, 
a yoga master. Although my pleasure is at an en- 
tirely different level, I must confess that my study 
of the h-likelihood framework is largely carried by 
both the authors' faith in their methods and my 
faith in the authors — they must have seen signs that 
most discussants did not. My Bayesian half urged 
me every weekend to seek Dr. Langdon's ambigram 
of "H," yet my other half kept seducing me with 
promises of hidden treasures. Indeed, a posteriori 
I am willing to move all probability from (V) to 
(IV), as well as to increase the probability of (II) 
over 50%, provided that we are always mindful of 
another "H" for h-likelihood — its Achilles' Heel — 
the potential (and often) non-negligibility of the R 
term. The Bartlizability and pivotal predictive inter- 
pretation of the h-likelihood methods could seduce 
someone to speculate that the "H" is The Lost Sym- 
bol, the eagerly awaited new thriller of Dan Brown. 
As a matter of fact, since I have already been se- 
duced for the past five weekends, far exceeding the 
originally planned 3-day excursion, I may as well en- 
joy my earned fantasy, a spoonful of my colleague 
Dr. Langdon's new experience, divine or not. . . . 
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